Climate-Study in Switzerland

Monthly averages of temperature and humidity of 14 wheather stations in Switzerland are available here. It reaches way back to 1867. Let's see what we can learn about

seasons
climate differences accross Switzerland
maybe even climate change.



In [1]:

    
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as pl
import statsmodels.formula.api as sm
import patsy as pt
from statsmodels.graphics.gofplots import qqplot
from datetime import datetime



In [2]:

    
import zipfile
from urllib.request import urlopen

# Get remote data:
url = 'http://data.geo.admin.ch/ch.meteoschweiz.homogenereihen/VQAA60.csv'

# Read into DataFrame:
data = pd.read_csv( urlopen(url), sep='|',
            skiprows=[0,1,2,3],
            na_values=['-'],
            names=['Station', 'Time', 'H', 'T'],
            parse_dates=['Time'], 
            date_parser=lambda d: pd.to_datetime(d, format='%Y%m') )

# statsmodels does not work so well with nans. Get rid of them:
data.dropna( inplace=True )

Data Inspection

First, let's have a look at the data we downloaded:



In [3]:

    
data[:6]

Ok, we have an integer index, station names (abreviated), and the time of measurement. The actual data columns are H (rain fall in millimeters) and T (temperature in $^{o}$C).

Let's create a hierarchical index using Station and Time. Furthermore, I want to know the month (for later fitting of the season), the year, and the time since the start of the measurements:



In [4]:

    
data['Month'] = pd.DatetimeIndex(data['Time']).month
data['Year'] = pd.DatetimeIndex(data['Time']).year
dt = data['Time'] - data['Time'].min()
data['dt'] = dt / pd.Timedelta(1, 'Y')

# Drop station 'PAY' as all data there is NaN:
data = data[data['Station'] != 'PAY']

# Create 2-level index (Station, Time):
data = data.set_index(pd.MultiIndex.from_arrays( 
                                 [data['Station'], data['Time']]),
                                 drop=False)

data[:6][['H', 'T', 'Month', 'Time']]

The data for 'BAS' (probably Basel) is shown below. As one would expect, the temperature is cyclic. Summer is warmer than winter. The precipitation looks mainly noisy. Taking means for each month (see table), however, hints that there is systematically more precipitation in summer than in winter.



In [5]:

    
fig, ax1 = pl.subplots()

pl.title('Temperature in BAS')
ax1.set_xlabel('Date')
ax1.set_ylabel('Temperature ($^{o}$ C)', color='r')
ax1.plot_date(data.index.levels[1].astype(datetime), data.loc['BAS']['T'],
              r'ro-', label='T')

ax2 = ax1.twinx()
ax2.set_ylabel('Rainfall (mm)', verticalalignment='bottom',
               rotation=-90, color='b')
ax2.plot_date(data.index.levels[1].astype(datetime), data.loc['BAS']['H'],
              r'bo-', label='H')

ax1.set_xlim(['1867', '1877'])

data.groupby('Month').aggregate(['mean', 'std'])[['H', 'T']]

Taking means by stations tells the warmer from the colder and the wetter from the drier locatiats:



In [6]:

    
data.groupby('Station').aggregate( ['mean', 'std'] )[ ['H', 'T'] ]

Seasons and Geographical Differences

The temperature is seasonal and depends on the station. So lets fit these dependencies.



In [7]:

    
# Helpers
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

def _extractLevels(categoricalVars):
    out = [name.split('[')[1][:-1] for name in categoricalVars]
    return(out)

def extractLevels(categoricalVarNames):
    """
    Extracts the level (or levels, in case of interaction terms from a list
    of column names (of a design matrix created by patsy).
    
    Example:
    
    input: [ 'C(Month)[T.2]:Station[T.BER]', 'C(Month)[T.3]:Station[T.BER]' ]
    output: [ ('2', 'BER'), ('3', 'BER') ]
    """
    
    out = [name.split(':') for name in categoricalVarNames]
    out = [_extractLevels(name) for name in out]
    return(out)
    

# Season:Station
# %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

data.dropna(inplace=True)

y, X = pt.dmatrices('T ~ C(Month):Station - 1', data=data)
model = sm.OLS(y, X)
fit = model.fit()
# print(fit.summary())

interactionSlice = X.design_info.term_name_slices['C(Month):Station']
interactionTemp = fit.params[interactionSlice]
interactionNames = X.design_info.column_names[interactionSlice]
interactionNames = extractLevels(interactionNames)
interactionNames = list(zip(*interactionNames))

data['interaction'] = data['Month'].astype('str') + ':' + data['Station']

pl.figure()
pl.title('Interaction Temperature-Deviation from January in BAS')
pl.xlabel('Month (Nr.)')
pl.ylabel('Temperature Deviation ($^{o}$ C)')

effects = fit.params.reshape(-1, 12)
pl.plot(effects.T, r'o-')

pl.legend(interactionNames[1][::12]);

Having fitted the obvious dependencies, we now - hopefully - have a clearer look at the non-obvious stuff. Let's diagnose the validity of the regression, first (see plots below):

The Q-Q-plot indicates a tail towards negative residuals
Residuals vs Month and Station are unsuspicious
Residuals vs time indicate an unmodeled dependency (temperature seems to increase over the century)
Residuals vs fitted values shows a somewhat larger variance at temperatures below ~2$^o$C



In [8]:

    
data['fit'] = fit.predict(exog=X)
data['resid'] = data['T'] - data['fit']

fig, ax = pl.subplots()
qqplot(data['resid'], ax=ax)

pl.figure()
pl.title('Residuals vs Month');
pl.plot(data['Month'], data['resid'], r'o')
pl.xlabel('Month (1)');
pl.ylabel('Residual ($^o$C)');

pl.figure()
pl.title('Residuals vs Station');
residStation = data.pivot(index='Time', columns='Station', values='resid')
pl.boxplot(np.asarray(residStation))
pl.xticks(range(1, len(residStation.columns)+1), residStation.columns)
pl.xlabel('Station');
pl.ylabel('Residual ($^o$C)');

pl.figure()
pl.plot_date(data['Time'].astype(datetime), data['resid'], r'o')
pl.xlabel('Time (years)');
pl.ylabel('Residual ($^o$C)');

pl.figure()
pl.plot(data['fit'], data['resid'], r'o')
pl.xlabel('Fitted Values ($^{o}$ C)');
pl.ylabel('Residual ($^o$C)');

Climate Change

Now we re-fit the data with a model that additionally includes a linear temperature trend. We find that the additional fit-parameter improves the fit and is significanty different from zero:



In [9]:

    
modelClimateChange = sm.ols('T ~ C(Month):Station + dt - 1', data=data)
fitCC = modelClimateChange.fit()
#tuple(fitCC.conf_int().loc['dt'])
'Temperature Drift: ({0:.2f} +/- {1:.2f}) K/century (95% Conf.)'.format(
                            fitCC.params['dt']*1e2, fitCC.bse['dt']*1e2)









    Out[9]:





'Temperature Drift: (1.35 +/- 0.03) K/century (95% Conf.)'

	Station	Time	H	T
0	BAS	1867-01-01	140.9	-0.2
1	BAS	1867-02-01	66.9	5.7
2	BAS	1867-03-01	70.8	4.9
3	BAS	1867-04-01	92.5	10.0
4	BAS	1867-05-01	73.9	13.6
5	BAS	1867-06-01	140.6	16.1

		H	T	Month	Time
Station	Time
BAS	1867-01-01	140.9	-0.2	1	1867-01-01
	1867-02-01	66.9	5.7	2	1867-02-01
	1867-03-01	70.8	4.9	3	1867-03-01
	1867-04-01	92.5	10.0	4	1867-04-01
	1867-05-01	73.9	13.6	5	1867-05-01
	1867-06-01	140.6	16.1	6	1867-06-01

	H		T
	mean	std	mean	std
Month
1	83.541880	78.870473	-2.791613	3.985315
2	76.770699	75.877096	-1.747151	4.510225
3	86.466703	75.698482	1.391286	4.805732
4	97.498870	75.366269	4.980796	5.069517
5	113.130931	72.567541	9.309736	5.001985
6	128.720710	73.514679	12.681926	4.971962
7	130.652366	79.420809	14.728925	4.937872
8	134.391237	79.178608	14.160968	4.635340
9	110.250134	76.685653	11.091134	4.348568
10	106.498763	86.235033	6.497473	3.957268
11	101.200269	89.541348	1.626667	3.797036
12	95.625767	86.927008	-1.742711	3.868841

	H		T
	mean	std	mean	std
Station
BAS	66.149469	39.326102	9.458021	6.550368
BER	83.489324	48.481764	7.869424	6.855973
CHD	109.260498	58.977209	5.366891	6.493301
CHM	101.421968	57.145141	5.460928	6.113159
DAV	81.542727	51.780745	2.595568	6.438817
ENG	129.585020	66.342045	5.530520	6.447743
GSB	183.151774	101.087606	-1.264591	5.872440
GVE	77.474455	48.267088	9.595975	6.741932
LUG	132.004192	100.705848	11.701174	6.687905
SAE	218.056590	120.579510	-2.033042	5.261947
SIA	81.794966	58.360747	1.611521	6.677687
SIO	47.418893	35.199694	9.074399	7.321170
SMA	88.480324	51.246788	8.428675	6.779093